84
Algorithms for Binary Neural Networks
(a) One-stage
(b) Two-stage
FIGURE 3.27
Effect of hyperparameters λ and τ on one- and two-stage training using 1-bit ResNet-18.
which is termed the ApproxSign function and is used for the backpropagation gradient
calculation of the activation. Compared to the traditional STE, ApproxSign has a shape
similar to that of the original binarization function sign, and thus the activation gradi-
ent error can be controlled to some extent. Similarly, CBCN [149] applies an approximate
function to address the gradient mismatch from the sign function. MetaQuant [38] intro-
duces Metalearning to learn the gradient error of weights using a neural network. IR-Net
[196] includes a self-adaptive Error Decay Estimator (EDE) to reduce the gradient error in
training, which considers different requirements on different stages of the training process
and balances the update ability of parameters and reduction of gradient error. RBNN [140]
proposes a training-aware approximation of the sign function for gradient backpropagation.
In summary, prior art focuses on approximating the gradient derived from
∂ba
∂ai,j or
∂bw
∂wi,j .
Unlike other approaches, our approach focuses on a different perspective of the gradient
approximation, i.e., gradient from
∂G
∂wi,j . Our goal is to decouple A and w to improve the
gradient calculation of w. RBONN manipulates w’s gradient from its bilinear coupling
variable A ( ∂G(A)
∂wi,j ). More specifically, our RBONN can be combined with the prior art by
comprehensively considering
∂LS
∂ai,j ,
∂LS
∂wi,j and
∂G
∂wi,j in the backpropagation process.
3.8.4
Ablation Study
Hyper-parameter λ and τ. The most important hyper-parameter of RBONN are λ and
τ, which control the proportion of LR and the threshold of backtracking in recurrent bilinear
optimization. On ImageNet for 1-bit ResNet-18, the effect of hyperparameters λ and τ is
evaluated under one- and two-stage training. The performance of RBONN is demonstrated
in Fig. 3.27, where λ ranges from 1e−3 to 1e−5 and τ ranges from 1 to 0.1. As observed, with
λ reducing, performance improves at first before plummeting. The same trend emerges when
we increase τ in both implementations. As demonstrated in Fig. 3.27, when λ is set to 1e−4
and τ is set to 0.6, 1-bit ResNet-18 generated by our RBONN gets the best performance. As